Positive-unlabeled learning for disease gene identification

نویسندگان

Peng Yang

Xiaoli Li

Jian-Ping Mei

Chee Keong Kwoh

See-Kiong Ng

چکیده

BACKGROUND Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. RESULT Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. CONCLUSION The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification. AVAILABILITY AND IMPLEMENTATION The executable program and data are available at http://www1.i2r.a-star.edu.sg/~xlli/PUDI/PUDI.html.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ensemble Positive Unlabeled Learning for Disease Gene Identification

An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P...

متن کامل

Computational Approaches for Disease Gene Identification

Identifying disease genes from human genome is an important and fundamental problem in biomedical research. Despite many publications of machine learning methods applied to discover new disease genes, it still remains a challenge because of the pleiotropy of genes, the limited number of confirmed disease genes among whole genome and the genetic heterogeneity of diseases. Recent approaches have ...

متن کامل

پیش بینی ژن‏ های بیماری با استفاده از دسته‏ بند تک‌کلاسی ماشین بردار پشتیبان

Abstract: In disease gene identification and classification, users are only interested in classifying one specific class, disease genes, without considering other classes (non-disease genes). This situation is referred to as one-class classification. Existing machine learning-based methods typically use known disease gene as positive training set and unknown genes as negative training set to bu...

متن کامل

Semi-Supervised Ranking for Re-identification with Few Labeled Image Pairs

In many person re-identification applications, typically only a small number of labeled image pairs are available for training. To address this serious practical issue, we propose a novel semi-supervised ranking method which makes use of unlabeled data to improve the reidentification performance. It is shown that low density separation or graph propagation assumption is not valid under some con...

متن کامل

Learning model order from labeled and unlabeled data for partially supervised classification, with application to word sense disambiguation

Previous partially supervised classification methods can partition unlabeled data into positive examples and negative examples for a given class by learning from positive labeled examples and unlabeled examples, but they cannot further group the negative examples into meaningful clusters even if there are many different classes in the negative examples. Here we proposed an automatic method to o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 28 شماره

صفحات -

تاریخ انتشار 2012

Positive-unlabeled learning for disease gene identification

نویسندگان

چکیده

منابع مشابه

Ensemble Positive Unlabeled Learning for Disease Gene Identification

Computational Approaches for Disease Gene Identification

پیش بینی ژن‏ های بیماری با استفاده از دسته‏ بند تک‌کلاسی ماشین بردار پشتیبان

Semi-Supervised Ranking for Re-identification with Few Labeled Image Pairs

Learning model order from labeled and unlabeled data for partially supervised classification, with application to word sense disambiguation

عنوان ژورنال:

اشتراک گذاری